Citation classics in general medical journals: assessing the quality of evidence; a systematic review

Aim: This review provides a comprehensive overview of more than 100 of the most cited studies in general medical journals and evaluates whether citations predict the quality of a scientific article. Background: The number of citations is commonly used as a measure of the quality and impact of a scientific article. However, it is often criticised that the number of citations is in fact a poor indicator of the true quality, as it can be influenced by different factors such as current trends. Methods: This review was conducted in line with the PRISMA guidelines. The Journal Citation Report (JCR) within Incites allowed the evaluation and comparison of articles, published in general medical journals, using far-reaching citation data drawn from scholarly and technical journals and conference proceedings. All steps of the review were performed in duplicate and conflicts were resolved through consensus. Results: The 100 most cited articles published from 1963 until the end of 2018 were identified. The number of citations ranged from 4012 to 31853. Most of the articles were published in the 2000’s, followed by the 1990’s, 1980’s, 1970’s and 1960’s, respectively. All of the articles were published in five journals. There were 50 studies at level II, 28 at level V, 10 at level IV, 7 at level III, and 5 at Level I. Conclusion: This systematic review provides an overview of the most cited articles, published in general medical journals. The number of citations provides an indication of the quality of evidence. However, researchers and clinicians should use standardized assessment tools rather than solely rely on the number of citations in order to judge the quality of published articles.


Introduction
The term Evidence-Based Medicine (EBM) was first coined by Guyatt, in 1991. It refers to the meticulous, purpose of the report was to develop recommendations on the periodic health examination based on evidence available in the medical literature. The quality of the evidence was determined by the degree to which it reflected the true theoretical effect of the intervention. The LOE system was further defined by Sackett in 1989 (3). The early hierarchy systems considered systematic reviews and randomised controlled trials (RCT's) to have the highest LOE, while case reports and expert opinion had the lowest LOE (4). This is because RCT's are designed to minimise bias and systematic error while on the other hand, expert opinion is frequently biased by the author`s experience and the lack of control. Over the past 20 years, the volume of published scientific literature has increased exponentially, and identifying relevant information has become a complex task for the individual investigator (5). Thus, researchers are encouraged to endorse the core principles of the hierarchy of evidence to answer definitive research questions. A citation is the acknowledgment one gives to a published or unpublished source. Citation count is regarded as a useful tool in obtaining a quantitative measure of the utilisation and contribution of a particular published paper. It also reflects the impact of the author's intellectual capability (6). However, whether the number of citations echoes the methodological quality remains an open question. Recently, several attempts have been made to identify and analyse highly cited articles, allowing the reader to understand their quality and characteristics (7)(8)(9). This bibliometric study identifies citation classics, published in general medical journals, and applies the empirical data to establish a quantitative assessment of the academic output, and to demonstrate the extent to which the number of citations can predict quality. This will allow us to reveal whether the number of citations can be utilised as a requirement of objective criteria for faculty hiring as well as performance evaluation. Furthermore, controversies concerning technical limitations of citations, database selectivity, time and discipline-related bias, publication type bias, authorships merits, and motivations for citing are addressed.

Methods
*The reporting of this systematic review conforms to the Preferred Reporting Items for Systematic Reviews and Meta-analysis (PRISMA) guidelines (10).

Information sources
The Web of Science was used to provide comprehensive citation data for articles, published in general medical journals. The Web of Science allowed the following databases to be identified: Medline, Web of Science Core Collection, BIOSIS previews, and SciELo Citation index.
The Journal Citation Report (JCR), within the Web of Science, allowed the evaluation and comparison of articles, published in broad-ranging medical journals, using far-reaching citation data drawn from scholarly and technical journals and conference proceedings. The JCR allowed the following information to be extracted: Journal-related data 1-Bibliographic information of publisher, title abbreviation, language and ISSN. 2-Subject Categories. Basic citation-related data 1-The number of articles published during that year and the number of citations that the articles have accrued. Detailed citation-related data 1-The number of times an article was cited, by later published articles, during the year. 2-The number of citations made from articles published in the journal, during each of the most recent 10 years. 3-The number of times articles, published in a specific journal, were cited by other journals, during each of the most recent 10 years.
*Several measures can be derived from these data including the impact factor, immediacy index, quartiles & JIF percentiles, cited & citing half-life.
Categories by Rank within the JCR was used to list all subject categories, ranked by the number of journals. Journals within the subject category `Medicine, General & Internal` were included and the data through which that year's calculated metric was displayed. For overall systematic review credibility, peer-reviewed open access journals were included in the study, with no language restriction. The search was conducted on the 16/12/2018 by two reviewers, with experience in bibliometrics, who independently reviewed the journals, articles, and abstracted data.
They also resolved any arising disagreements through consensus.

Eligibility criteria
The 100 most-frequently cited articles published in journals, within the subject category `Medicine, General & Internal`, were identified. Articles were included in data extraction if they were published in peer-reviewed journals, covering the full spectrum of the medical sciences. Journals publishing mainly clinical research, in internal medicine and related subspecialities, were excluded from the study (Figure 1).
*The institution and country of origin were defined based on the affiliation provided for the first author. In the case of group authorship, the affiliation was regarded that of the corresponding author. In the case of papers accepted in more than one journal, only the top cited paper was included in the study. The Oxford Centre for Evidence-Based Medicine (OCEBM) classification was used to analyse the level of evidence of articles (11,12).

Levels of evidence
Level 1: systematic review of randomisedcontrolled trials/ systematic review of inception cohort studies/ systematic review of nested case-control studies/ systematic review of cross-sectional studies with consistently applied reference standard and blinding/ local and current random sample surveys or censuses.
Level 2: systematic review of surveys that allow matching local circumstances/ cross-sectional studies with consistently applied reference standard and blinding/ inception cohort studies/ randomized controlled trials/ observation study with dramatic effects.
Level 3: Local non-random sample/ nonconsecutive studies/ studies without consistently applied reference standards/ cohort study or control arm of randomized trials/ non-randomized controlled cohort/follow up studies.
Level 4: case-series/ case-control/ poor or nonindependent reference standard/ poor quality prognostic cohort study/ historically controlled studies.

Data synthesis and analysis
Statistical analysis was performed using only the data from studies in the extracted subset. The tidyverse collection of R programming language and its libraries, version 3.5, was used to implement a wide variety of statistical and graphical techniques. The Shapiro-Wilk test was applied to estimate the variance of the sample. The Pearson R correlation was employed to measure the degree of the relationship amongst linear related variables. The non-parametric tests, Kendall rank correlation and Kruskal-Wallis test, were used to measure the strength of interdependence between variables and to assess significant differences on a continuous dependent variable through a categorical independent variable respectively.
The Pearson R and Kendall rank correlations results were expressed as a range between -1 and 1, with -1 being a strong negative correlation and 1 a strong positive correlation. Probability values were twotailored and the threshold for significance was set at p<0.05.

Patient and Public Involvement:
Patients and public were not involved in this study.

Citation count and density
The total number of citations ranged from 4021 to 31853. The mean number of citations stood at 6179 (normally distributed data). Seven articles received over 10000 citations, and more than half of the articles had over 5000 citations. The citation density (the mean number of citations per year=total number of citations/years since publication of the article) varied from 112 to 2258. The median citation density was 314 (non-normally distributed data) and over half of the articles had a density of over 300, as shown in Table 1.

Year of publication
The  (Table 1).

Journals publishing the citation classics
The citation classics were published in 5 different journals; these were predominantly comprehensive medical journals, led by New England Journal of Medicine with 57 articles, followed by Lancet with 21 articles, Journal of the American Medical Association with 17 articles, British Medical Journal with 4 articles and Plos Medicine with 1 article. The impact factor of the academic journals ranged from 11.675 for Plos Medicine to 79.260 for New England Journal of Medicine. Further, 95% of the articles were published in journals with impact factor higher than 47.661. Table  2 lists the journals in which the citation classics were published in.

Authorship, country of origin and institutions
The majority of the citation classics were produced by three or more authors (85%). With regards to individual contributions, Bland JM was the author of the most cited publication, with 31853 citations. The second most cited publication, with 20319 citations, was published by Moher D, whose name appeared in three articles within the 100 most cited list. Randle PJ authored the publication with the minimum citation number (4021 citations). Memish ZA contributed to three of the 100 most cited articles (14614 citations), followed by Altman DG (36109 citations), Ross R (21335 citations), National Cancer Institute of Canada Clinical Trials Group (13218 citations), Folkman J (12300) and Flegal KM (10395 citations), each of whom contributed to 2 articles respectively. The citation classics originated from 14 different countries. A total of 63 articles were published by authors from the USA. The United Kingdom was the second most productive country with 16 articles published, followed by Canada with 8 articles (Figure 2).
Articles originating from the USA had the largest total number of citations (363831), followed by the UK (126463) and Canada (55497). The highly cited articles originated from a total of 61 institutions. Of the 61 institutions, 13 institutions had two or more of their articles appearing within the citation classics list. Amongst them, the leading institutions were Harvard University (10 articles), McMaster University (7 articles), and Centres for Disease Control and Prevention (6 articles). (Table 1, Figure 2, Figure 3).

Citation classics' classification and fields of medicine
Amongst the articles extracted, 87% were original articles and the remaining 13% were review articles. All 87 original articles were published in 4 journals led by New England Journal of Medicine (53 articles),      The articles focused on 19 different fields. The field of cardiology was the most common speciality topic with 28 articles (167256 citations). Articles discussing oncology (19 articles, 105574 citations), public health (13 articles, 60206 citations), endocrinology (10 articles, 69400 citations) and quality of reporting (6 articles, 62455 citations) were also represented. The lowest two cited fields were radiology (4713 citations) and alternative medicine (4448 citations). Further topics of interest are listed in Table 1 and Table 2.

Levels of evidence and methodologies
The 100 papers had a wide range of evidence levels. There were 50 studies at level II, 28 at level V, 10 at level IV, 7 at level III and 5 at Level I. The level of evidence I received the highest overall citation number of 305242 citations, followed by level II at 191092 citations, Level III at 50903 citations, level IV at 37644 citations, and level V at 32996 citations. For each level of evidence, there were large proportions of publications that were young (age less than 30 years old) and had less than 5000 citations. Furthermore, the level of evidence II included publications with the maximum number of citations, being under the age of 30. Likewise, level I and V also included publications with a large number of citations, with ages under 30. Level IV and V were the only ones with publications still receiving citations while being over 40 years old. These associations formed skew relationships and hence the median was used as the measure of central tendency, which can be seen as the red dot. The red dots have been placed near the centre of the fastest section of each plot (Figure 4). Of the articles published in the 1960's, 1 article was of level I and 1 was of level V. All 4 articles published in the 1970's were of level V. During the 1990's, authors published 15 level II articles, 5 level III, 5 level V, 4 level IV, and 2 level I articles. The decade 2000's received contributions from 34 level II articles, 14 level V, 5 level IV, 2 level III, and 3 level I. Interestingly, for level I and IV, there are no publications younger than 12 years. On the other hand, for levels II, III and V we have publications as young as 4 years old. As seen, it is only level II that has publications with citations at 30000 citations. From the plots, it seems that 12-30 years ago, there were a great number of publications published with level of evidence II ( Figure 5).   (Table 1).

Statistical associations between level of evidence, citation number, density, and age
For any level of evidence, the Shapiro-Wilk test for the assessment of normality revealed normally distributed data for citations (p>0.05) and non-normally distributed data for density (p<0.05). The Kruskal-Wallis test indicated no significant difference in the distribution of the variables; citations (p=0.3936), density (p=0.1637), and age (p=0.2904), across each level of evidence.
Based on the plots of citations and density over age, we can make assumptions about the type of evidence presented and the effect these articles have on the field. However, since measures for citations were closer to 0 than to 1, we cannot be sure if the type of evidence is indicative of the impact (citation number).

Discussion
Citation analysis generates a large body of statistical material, providing an insight into scientific trends and sociological diversity. In this study, we used citation indices to assess the quality of published articles. Indeed, they are tools for evaluation, however it remain an issue for further research to determine further whether evaluation tools are needed (13). Although most scientific papers usually reach their optimum citation rate within 3 years of publication, citation indices are time-specific (14). Our examination of citation classics, published in general medical journals, demonstrated seminal contributions that were transiently popular topics. While these older seminal articles would expectedly receive a greater number of citations than more recent articles, 89% of the so-called best sellers had been published in the 1990's and the 2000's. This may be due to the fact that key concepts become universally accepted, and as a result, are no longer cited.
The Journal Impact actor (JIF) reflects the citation rate of the average published article, in a particular journal, over two years (15). The five journals in which the citation classics were published were all within the top quartile. The distribution of citations across articles within a journal is not uniform. Thus, most citations within a journal come from a minority of the published articles. In spite of the weak relationship between JIF and article citation counts, editors-in chief of high impact journals tend to accept high quality articles to maintain and develop their journal's profile and reputation. This implies that the JIF is a useful tool in assessing the quality of published articles.
Most authors of the top-cited papers are established leaders in their field. Robert Merton observed that better-known scientists tend to receive more credit than their less well-known counterparts for the same achievements. Thus, authors with a significant reputation and record of publishing are cited more readily. This may be due to the fact that well-established authors tend to write quality papers and tend to present them in highly ranking journals, which are widely distributed and indexed by major abstracting services (16).
The citations classics originated from 61 institutions in 14 countries and covered 19 disciplines. The number of citations that different countries, institutions and disciplines accrue depend on factors other than quality and originality. It is well established that the citation rate varies across disciplines (14). Disciplines with longer turnover times are the most affected by the time lag. This, in turn, can also affect various institutions and countries (17,18). The time-associated variations constitute only a part of the citation patterns. Categories, with more published articles and funding, tend to receive more citations. The scope of the discipline might be another factor that accounts for the difference. For instance, many scientists outside the field of cardiology may be citing cardiology papers. This leads to an increased number of citations beyond what we expect based on bibliometric indicators (the field cardiology constituted 28% of the citation classics). The citation pattern in other fields, such as in radiology and alternative medicine (fields with the lowest ranking citation classics), is significantly narrower as only the people in these fields cite one another (19).
One sociological aspect included the language bias. It was noted that non-English peer-reviewed materials did not make it into the list of citation classics. Citing non-English or non-Roman script papers is uncommon. A journal editor once stated "what is useful to readers, who may want to find, read and even translate referenced articles". We would like to raise some questions for the readers of this article -Would an article, with reference to non-English articles, be rejected by certain journals just because it has cited non-English articles? Would editors and reviewers of certain journals ask for those references to be excluded just because they could not find a reviewer who can verify them? Although English is now considered as an international language for science, data are nowadays easily accessible in the non-English literature and credit should also be given to our fellow non-Englishspeaking world when deemed necessary (20).
Of the citation classics, 85% had three or more authors. In applied citation analysis, multi-authored papers are generally more highly cited compared to single-authored papers. This may be due to multiauthored articles attracting a variety of practical and intellectual proficiencies and thus presenting a greater diversity of ideas and data sets (21).
Certain types of articles are bound to be cited more frequently. It is well recognised that various study designs correspond to different levels of evidence, with systematic reviews, meta-analysis, and RCT`s providing the highest quality of evidence, and case reports and expert opinions offering the lowest quality of evidence (4). Amongst the highly cited articles, clinical studies (92 articles) were far more common compared to pre-clinical (8 articles) studies. Conforming to the classification, schemed by the Oxford Centre for Evidence-Based Medicine, most articles had a high level of evidence. The levels of evidence I and II constituted almost 50% of the total number of citations gained by all articles. On the other hand, the levels IV and V constituted only 11% of the total number of citations. The regression lines in this study revealed that a paper, of level of evidence I or II, will experience a strong positive increase in citations against the age of the article. A weaker positive correlation has been noted for both levels III and IV. An article of level V on the other hand will experience a decline in citations against age. The mean yearly citation rate has increased remarkably for articles of level I.
We suggest that the high representation of high level of evidence studies does not imply that they are the only studies being performed; rather that they have been cited more frequently compared to low level of evidence articles. For instance, in line with the plots, the last 30 years have experienced an increase in the number of published level II studies (RCT´s). This might be attributed to the improved methodological qualities of protocols and the increasing rate of technological innovation, allowing for more randomised comparisons. This has led RCT´s to become a reliable and robust source of evidence in healthcare interventions, and thus receive more popularity amongst researchers (22).
Our recommendations: 1-The ISI should be the definitive scientific citation indexing service since it lists a large fraction of all published articles, covering broad disciplines and individual specialities. The ISI indexing service also facilitates comparisons, bibliographic coupling (identification of related articles based on common citations) and identification of co-citation studies (studies cited together in later articles) (7,23,24). Furthermore, the ISI can also be used to eliminate selfcitations from a citation count.
2-This study revealed that citation-based indicators are very useful, in assessing the quality of published articles, but they should be deployed, in more nuanced and open ways, alongside other metrics.
3-Institutes should consider including citation count as part of the evaluation, for determining research priorities, allocating funding, deciding tenures, promotions and appointments, and lowering the productivity threshold, to put more focus on the quality of published work.
4-This study indicated that including self-citations in the citation count is currently a minor problem when used as a proxy for importance or quality. However, it would be worthwhile to report self-citations alongside other metrics to identify and curb excessive selfcitations in the future and flag potential self-promoters. 5-Those manipulating the peer-review process to amass citations to his/her own work should be identified and removed from the editorial board or banned as reviewers.
6-Editors should avoid artificially boosting impact factors, by encouraging the citation of a journal's own papers.

limitations of this study
The citation classics list may be criticized on a few accounts. By including articles published in general medical journals and not including subspecialty journals, several highly cited articles have been excluded from the list (25)(26)(27)(28)(29). Furthermore, the ISI has been reported to sporadically miss citations older than 1980. The indexing system has also been shown to have discrepancies, compared to the original publication, in at least one data field amongst 10% of the published articles (30). In the case of discrepancies, the original publication's data were extracted.